Linear Regression

In this chapter, you will learn:

  • Simple Linear Regression: A model that maps the relationship from a single explanatory variable to a continuous response variable with a linear model.
  • Multiple Linear Regression: A generalization of simple linear regression that maps the relationship from more than one explanatory variable to a continuous response variable.
  • Polynomial Regression: A special case of multiple linear regression that models nonlinear relationships.
  • Linear Regression Model Training: finding the parameter values for the linear regression model by minimizing a cost function.

Simple Linear Regression

  • Assumption: A linear relationship exists between the response variable and the explanatory variable. SLR models this relationship with a linear surface called a hyperplane. A hyperplane is a subspace that has one dimension less than the ambient space that contains it.
  • Task: Predict the price of a pizza
  • Explanatory Variable: Pizza size
  • Response Variable: Price

Data

Training Instance Diameter (inches) Price (dollars)
1 6 7
2 8 9
3 10 13
4 14 17.5
5 18 18

Visualizing the Data

We can use matplotlib to visualize our training data


In [51]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('ggplot')

# X is the explanatory variable data structure
X = [[6], [8], [10], [14], [18]]

# Y is the response variable data structure
y = [[7], [9], [13], [17.5], [18]]

# instantiate a pyplot figure object
plt.figure()

plt.title('Figure 1. Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)
plt.show()


Based on the visualization about, we can see that there is a positive relationship between pizza diameter and price.

Training a Simple Linear Regression Model

We use scikit-learn to train our first model


In [52]:
from sklearn.linear_model import LinearRegression

# Training Data
# X is the explanatory variable data structure
X = [[6], [8], [10], [14], [18]]

# Y is the response variable data structure
y = [[7], [9], [13], [17.5], [18]]

# Create and fil the model
model = LinearRegression()

# Fit the model to the training data
model.fit(X, y)

# Make a prediction about how much a 12 inch pizza should cost
test_X = [12]
prediction = model.predict(test_X)

print 'A 12\" pizza should cost: $%.2f' % prediction[0]


A 12" pizza should cost: $13.68

The sklearn.linear_model.LinearRegression class is an estimator. Given a new value of the explanatory variable, estimators predict a response value. All estimators have the fit() and predict() methods

fit() is used to learn the parameters of a model, while predict() predicts the value of a response variable given an explanatory variable value.

The mathematical specification of a simple regression model is the following:

$${y} = \alpha+ \beta{x}$$

Where:

  • ${y}$: The predicted value of the response variable. In this case, the price of the pizza.
  • ${x}$: The explanatory variable. In this case, the diameter of the pizza in inches.
  • $\alpha$: The y-intercept term.
  • $\beta$: The coefficient term (i.e. the slope of the line).

In [56]:
# instantiate a pyplot figure object
plt.figure()

# re-plot a scatter plot
plt.title('Figure 2. Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)

# create the line of fit
line_X = [[i] for i in np.arange(0, 25)]
line_y = model.predict(line_X)
plt.plot(line_X, line_y, '-b')
plt.show()


Training a model to learn the values of the parameters for simple linear regression to create the best unbiased estimator is called ordinary least squares or linear least squares. To get a better idea of what "best unbiased estimator" is estimating in the first place, let's define what is needed to fit a model to training data.

Evaluating the Fitness of a Model with a Cost Function

How do we know that the parameter values specified by a particular model is doing well or poorly? In other words, how can we assess which parameters produced the best-fitting regression line?

Cost Function / Loss Function

The cost function or loss function provides a function that measures the error of a model. In order to find the best-fitting regression line, the goal is to minimize the sum of the differences between the predicted prices and the corresponding observed prices of the pizzas in the training set, also known as residuals or training errors.

We can visualize the residuals by drawing a vertical line from the observed price and the predicted price. Fortunately, matplotlib provides the vlines() that takes the x, ymin, and ymax arguments to draw a vertical line on a plot. We re-create Figure 2 but with the residuals this time.


In [58]:
# instantiate a pyplot figure object
plt.figure()

# re-plot a scatter plot
plt.title('Figure 3. Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)

# create the line of fit
line_X = [[i] for i in np.arange(0, 25)]
line_y = model.predict(line_X)
plt.plot(line_X, line_y, '-b')

# create residual lines
for x_i, y_i in zip(X, y):
    plt.vlines(x_i[0], y_i[0], model.predict(x_i), colors='r') 

plt.show()


Now that we can clearly see the prediction error (in red) made by our model (in blue), it's important to quantify the overall error through a formal definition of residual sum of squares.

We do this by summing the squared residuals for all of our training examples (we square the residuals because we don't care whether the error is in the positive or negative direction).

$$RSS = \sum_{i=1}^n\big(y_{i} - f(x_{i})\big)^2 $$

Where:

  • $y_{i}$ is the observed value
  • $f(x_{i})$ is the predicted value.

A related measure of model error is mean squared error, which is simply the mean of the residuals:

$$MSE = \dfrac{1}{n}\sum_{i=1}^n\big(y_{i} - f(x_{i})\big)^2 $$

Let's go ahead and implement RSS and MSE using numpy:


In [64]:
import numpy as np
rrs = np.sum((model.predict(X) - y) ** 2)
mse = np.mean((model.predict(X) - y) ** 2)

print 'Residual sum of squares: %.2f' % rrs
print 'Mean squared error: %.2f' % mse


Residual sum of squares: 8.75
Mean squared error: 1.75

Now that we've defined the cost function, we can find the set of parameters that minimize the RSS or MSE.

Solving Ordinary Least Squares for Simple Linear Regression

Recall the equation for simple linear regression:

$$y = \alpha + \beta{x}$$

Goal:

Solve the values of $\beta$ and $\alpha$ such that they minimize the RSS cost function.

Solving for $\beta$

Step 1: Calculate the variance of $x$

Varience is a summary statistic that represents how spread apart a set of values is. Intuitively, the variance of set A = {0, 5, 10, 15, 20} is greater than the variance of set B = {5, 5, 5, 5, 5}. The formal definition of variance is:

$$var(x) = \dfrac{\sum_{i=1}^{n}\big(x_{i} - \bar{x}\big)^2}{n - 1}$$

Where:

  • $\bar{x}$ is the mean of $x$
  • $x_{i}$ is the value of $x$ for the $i^{th}$ training instance
  • $n$ is the number of training instances

Let's implement variance in Python.


In [72]:
from __future__ import division

# calculate the mean
n = len(X)
xbar = sum([x[0] for x in X]) / n

# calculate the variance
variance = sum([(x[0] - xbar) ** 2 for x in X]) / (n - 1)

print 'Variance: %.2f' % variance


Variance: 23.20

Step 2: Calculate the covariance of $x$ and $y$

Covariance is a summary statistic that represents how two variables tend to change together. Suppose you have 3 sets:

  • $X = \big\{1, 2, 3, 4, 5\big\}$
  • $Y = \big\{100, 110, 120, 130, 140, 150\big\}$
  • $Z = \big\{{-1}, {-2}, {-3}, {-4}, {-5}\big\}$

We can say that cov(X, Y) is positive and the cov(X, Z) is negative. If there is no linear relationship between two variables, then their covariance will equal zero. For example, if we have a forth set of values:

  • $W = \big\{{-1}, 1, {-1}, 1, {-1}\big\}$

Then cov(X, W) would be zero.

Let's do a sanity check on this intuition by implementing the formal definition of covariance:

$$ cov(x,y) = \dfrac{\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}{n - 1} $$